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(54) Method and system to obtain data for information retrieval 

(57) To obtain a query for use in information 
retrieval, a document is scanned (S4). The resulting text 
image data define an image of a segment of text in a 
first language. Automatic recognition is then performed 
(S6, S8, S10) on at least pari of the text image data to 
obtain text code data including a series of element 
codes. Each element code indicates an element that 
occurs in the first language, and the series of element 
codes defines a set of expressions that also occur in the 
first language. Automatic translation is then performed 
(S20) on a version of the text code data to obtain trans- 
lation data indicating a set of counterpart expressions in 
a second language. The counterpart expressions are 
used to automatically obtain query data defining the 
query (S22). The query can then be provided to an 
information retrieval engine (S24). 
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Description 

[0001] The present invention relates to obtaining 
query data for information retrieval. 
[0002] Most multilingual speakers can read some lan- 
guages more easily than they can generate correct 
utterances and written expressions in those languages. 
When searching for information, existing information 
retrieval systems require that the user formulate a query 
in the language (target language or L2) of the docu- 
ments and, normally, physically type in the query. Thus, 
as well as including a query formulation step, such sys- 
tems do not allow a user to indicate their search inter- 
ests in their native language (L1). 
[0003] Ballesteros, L, and Croft, W.B., ..Dictionary 
Methods for Cross-Lingual Information Retrieval", in 
Proceedings of the 7 th International DEXA Conference 
on Database and Expert Systems, 1996, pp. 791-801, 
disclose techniques in which a user can query in one 
language but perform retrieval across languages. Base 
queries drawn from a list of text retrieval topics were 
translated using bilingual, machine-readable dictionar- 
ies (MRDs). Pre-translation and post-translation feed- 
back techniques were used to improve retrieval 
effectiveness of the dictionary translations. 
[0004] EP-A-725,353 discloses a document retrieval 
and display system which retrieves source documents 
in different languages from servers linked by a commu- 
nication network, translates the retrieved source docu- 
ments as necessary, stores the translated documents, 
and displays the source documents and translated doc- 
uments at a client device connected to the communica- 
tion network. 

[0005] US-A-5,748,805 discloses a technique that 
provides translations for selected words in a source 
document. An undecoded document image is seg- 
mented into image units, and significant image units 
such as words are identified based on image character- 
istics or hand markings. For example, a user could mark 
difficult or unknown words in a document. The signifi- 
cant image units are then decoded by optical character 
recognition (OCR) techniques, and the decoded words 
can then be used to access translations in a data base. 
A copy of the document is then printed with translations 
in the margins opposite the significant words. 
[0006] The invention addresses a problem that arises 
with information retrieval where a user has a document 
in one language (L1) and wishes to access pertinent 
documents or other information written in a second lan- 
guage (L2) and accessible through a query-based sys- 
tem. Specifically, the invention addresses the problem 
of generating a query that includes expressions in the 
second language L2 without translating or retyping the 
document in the first language L1 , referred to herein as 
the document-based query problem. The document- 
based query problem arises, for example, where the 
user cannot translate the document from L1 to L2 
.where the user is unable to type or prefers not to type, 



where the user does not have access to a machine with 
a keyboard on which to type, or where the user does not 
know how to generate a query that includes expres- 
sions in L2. 

5 [0007] The invention alleviates the document-based 
query problem by providing a new technique that scans 
the document and uses the resulting text image data. 
The new technique performs automatic recognition to 
obtain text code data with a series of element codes 

10 defining expressions in the first language. The new 
technique performs automatic translation on a version 
of the text code data to obtain translation data indicating 
counterpart expressions in the second language. The 
new technique uses the counterpart expressions in the 

15 second language to automatically obtain query data 
defining a query for use in information retrieval. 
[0008] The new technique can be implemented with a 
document that is manually marked to indicate a seg- 
ment of the text, and text image data defining the indi- 

20 cated segment can be extracted from image data 
defining the document. 

[0009] Automatic recognition can be implemented 
with optical character recognition (OCR), and automatic 
language identification can be performed to identify the 

25 probable predominant language so that language-spe- 
cific OCR can be performed. The OCR results can also 
be presented to the user, who can interactively modify 
them to obtain the text code data. 
[001 0] Automatic translation can be implemented with 

30 a translation dictionary. The text code data can be 
tokenized to obtain token data; the token data can be 
disambiguated to obtain disambiguated data with parts 
of speech for words; the disambiguated data can be 
lemmatized to obtain lemmatized data indicating, for 

35 each of a set of words, either the word or a lemma for 
the word; and the lemmatized data can be translated. 
Translation can be done by looking up the words and 
lemmas in a bilingual translation dictionary. 
[001 1 ] The query data can define the query in a format 

40 suitable for an information retrieval engine. The query 
data can then be provided to the information retrieval 
engine. 

[0012] The new technique can also be implemented in 
a system that includes a scanning device and a proces- 

45 sor connected for receiving image data from the scan- 
ning device. After receiving an image of a segment of 
text in the first language from a scanned document, the 
processor performs automatic recognition to obtain text 
code data, performs automatic translation on a version 

50 of the text code data to obtain translation data indicating 
expressions in the second language, and uses the 
expressions to automatically obtain query data defining 
a query for use in information retrieval. 
[0013] An advantage of the invention is that it elimi- 

55 nates the need for knowing how information interest (or 
query) should be formulated in the target language, as 
well as eliminating the need for imagining and typing in 
the query. In certain embodiments of the invention, the 
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user need only designate a portion of an existing docu- 
ment, e.g. a hardcopy document, which is of interest to 
him. 

[0014] The invention will now be described, by way of 
example, with reference to the accompanying drawings, 
in which: 

Figure 1 schematically illustrates an exemplary 
computer network that may be employed in using a 
document in a first language to obtain a query with 
expressions in a second language for use in infor- 
mation retrieval; 

Figure 2 is a schematic block diagram of the multi- 
function device in the network of Fig. 1; 
Figures 3A, 3B, and 3C together are a flow diagram 
schematically illustrating detailed acts that can be 
performed in using a document in a first language 
to obtain a query with expressions in a second lan- 
guage for use in information retrieval; 
Figure 4 shows a document from which a user 
wishes to isolate a portion of text; 
Figure 5 depicts the text portion isolated by the 
user; 

Figure 6 shows the result of OCR in Fig. 3A on the 
text portion of Fig. 5; 

Figure 7 illustrates the results of the tokenization, 
morphological analysis and part-of-speech tagging 
(disambiguation) in Figs. 3A and 3B; 
Figure 8 shows the results of the lemmatization in 
Fig. 38 on the disambiguated text of Fig. 7; 
Figure 9 depicts the results of bilingual on-line dic- 
tionary look-up in Fig. 38 performed on the lemmas 
of Fig. 8; 

Figure 10 shows a textual query resulting from the 
formatting operation of Fig. 3C on the text in lan- 
guage L2 derived from the results in Fig. 9; 
Figure 11 illustrates a list returned by the informa- 
tion retrieval engine of titles of documents matching 
the query of Fig. 10, in ranked order; and 
Figure 12 shows the display of (the first page of) a 
document in the list of Fig. 1 1 following the selec- 
tion of the document by a user. 

[0015] FIG. 1 schematically depicts an example of a 
computational system or network 110 suitable as a 
basis for implementing embodiments of the invention: 
this is discussed further in US-A-5,692,073. System 
110 includes a fax machine 120, a „smart" multifunction 
device (MFD) 130 (that is, a multifunction device incor- 
porating a processor (CPU) and memory), a personal or 
office computer 100, one or more local server comput- 
ers 140, and one or more World Wide Web server com- 
puters 150. These are connected by various 
communications pathways including telephone connec- 
tions 111, a local area network 141, and the Internet 
151. Computer 100 includes a modem 108 and option- 
ally a CD-ROM mass storage device 109, and has 
attached peripherals including an optical scanner 103 



and a printer 104. 

[001 6] Persons of skill in the art will appreciate that the 
design of system 110 is intended to be illustrative, not 
restrictive. In particular, it will be appreciated that a wide 

5 variety of computational, communications, and informa- 
tion and document processing devices can be used in 
place or in addition to the devices 120, 130, 140, 150, 
and 100 shown in system 110. Indeed, connections 
through the Internet 151 generally involve packet 

10 switching by intermediate router computers (not 
shown), and computer 100 is likely to access any 
number of Web servers 1 50 during a typical Web brows- 
ing session. Also, the devices of system 110 can be 
connected in different ways. For example, printer 104 is 

15 shown as being an attached peripheral of computer 
100, but it could also be a networked printer, accessed 
via local area network 141 through a print server that is 
one of the local servers 140. 

[0017] The various communication pathways 111, 

20 141, 151 in system 110 allow the devices 120, 130, 140, 
150, 100 to communicate with one another. Telephone 
connections 1 1 1 allow fax machine 1 20 to communicate 
with MFD 130, and also with computer 100 by way of 
modem 108. Local area network 141 allows computer 

25 100 to communicate with local server(s) 140. The Inter- 
net 151 allows MFD 130 and computer 100 to commu- 
nicate with Web server(s) 150. 
[001 8] A wide variety of possibilities exists for the rel- 
ative physical locations of the devices in system 110. 

30 For example, fax machine 120 and MFD 130 can be in 
the same building as each other or around the globe 
from one another, and either or both can be in the same 
building as computer 1 00 or around the globe from com- 
puter 100. Web server(s) 150 can likewise be at local 

35 (so-called Jntranet") or remote sites with respect to 
computer 100 and MFD 130. The distance between 
computer 100 and local server(s) 140, of course, is lim- 
ited by the technology of local area network 1 41 . 
[001 9] A user or users can access system 1 1 0 at var- 

40 ious points and in various ways. For example, a user 
can provide inputs to and receive outputs from system 
110 through fax machine 120, through MFD 130, or 
through the scanner 103 and printer 104 of computer 
100. In particular, a user who is near fax machine 120 

45 can send a fax from fax machine 120 to computer 100, 
and computer 100 (which may be suitably programmed 
with Formless Forms PUI software) can automatically 
send a fax back to the user at fax machine 120. Simi- 
larly, the user can send a fax from fax machine 120 to 

so MFD 130 and MFD 130 (likewise assumed to be suita- 
bly programmed) can automatically send a fax back to 
the user at fax machine 1 20. A user who is near compu- 
ter 1 00 can interact with computer 100 through its PUI in 
conjunction with scanner 103 and printer 104. A user 

55 who is near MFD 130 can interact with MFD 130 
through its scanning and printing capabilities, thereby 
using MFD 130 as a kind of personal computer, a com- 
puter having a user interface that is primarily or even 
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exclusively paper-based. Finally, the user can interact 
with Web server(s) 150 by browsing the Web. This can 
be done directly from computer 100 or MFD 130, or indi- 
rectly from fax machine 120 by way of either computer 
100 or MFD 130. 

[0020] Figure 2 is a block diagram of a multifunction 
device (MFD) which may be employed in the implemen- 
tation of the present invention: this is discussed further 
in EP-A-741 ,487, The block diagram of Fig. 2 illustrates 
a MFD 222, which enables a user of a personal compu- 
ter 220 to move easily between paper and electronic 
representations of a document. The MFD 222 prints 
documents, copies documents, as well as transmitting 
and receiving facsimile documents. MFD 222 performs 
these tasks via multifunction controller 224, fax modem 
226, scanner 230, and printer 228. Though not shown, 
MFD 222 may also include an additional display device 
such as a CRT or LCD display. Multifunction controller 
224 controls the operation and cooperation of input/out- 
put devices 226, 228 and 230 using multifunction oper- 
ating system 232. The multifunction operating system 
232 selects appropriate command sequences, which it 
passes to processor 234 for execution. Multifunction 
operating system 232 may be realized as software 
stored within a memory device, and may be for exam- 
ple, Microsoft at Work™. 

[0021 ] Fax modem 226, scanner 230, printer 228, net- 
work port 221. and multifunction controller 224 repre- 
sent the documents that they handle using scan line 
signals. Scanner 230 generates scan line signals from 
the images on a hard copy document, while printer 228 
marks images on a marking medium using scan line 
signals. Fax modem 226 and multifunction controller 
224 use scan line signals received from the PC 220, a 
network port 221 , telephone lines, the printer 228, or the 
scanner 230 to enable movement of information 
between electronic media and paper. The functionality 
of the multifunction operating system 232 is enhanced 
by calls to additional processes, including those accord- 
ing to embodiments of the present invention. Those 
processes are preferably realized using instructions 
executed by the processor 234 and stored in object 
code form within a memory 236. The memory 236 can 
be realized using solid state memory devices such as 
ROM, RAM, DRAM, PROM, EPROM and EPROM. 
[0022] It will be apparent to persons skilled in the art 
that where references are made herein to steps, opera- 
tions or manipulations involving characters, words, pas- 
sages of text, etc., these are implemented, where 
appropriate, by means of software controlled processor 
operations upon machine readable (e.g. ASCII code) 
representations of such characters, words and text. 
Similarly, references to steps, operations, or manipula- 
tions involving images, image segments or documents 
can be implemented, where appropriate, by means of 
software controlled processor operations upon data 
representations of such images, image segments, or 
documents such as would be produced by any of the 



scanning devices in system 110, whether scanner 103, 
fax machine 120, or MFD 222. In either case, the proc- 
essor could be any of the processors in system 110, 
whether a processor in fax machine 120, a central 
s processing unit (CPU) or other processor in computer 
100 or computer 220, a processor in web server(s) 1 50, 
or processor 234 in MFD 222. 

[0023] Figure 3 is a flow diagram schematically illus- 
trating acts in an implementation of the invention. As 

10 seen in box S2 in Fig. 3A, initially, the user can manually 
isolate the portion of text that he or she wishes to use as 
the basis of a multilingual search. Figure 4 shows a doc- 
ument (illustratively part of the front page of a newspa- 
per, though any document could be used) from which a 

is user wishes to isolate a portion of text: here, the chosen 
portion is the text portion 2, which is an article in the 
newspaper, the language L1 is French, and the user iso- 
lates the text portion 2 by cutting it out of the newspaper, 
as shown in Fig. 5. The act in box S2 can thus comprise 
20 scanning the text portion 2 alone. The user then places 
the isolated portion 2 on the platen of MFD 222 or 
another scanning device for scanning. After image data 
defining an image of text portion is obtained, the image 
data can be provided to processor 234 of MFD 222, as 
25 shown in box S4, or to another processor. The image 
data, which may take the form of a file, may be supplied 
to the processor directly, via a network connection, or 
through another connection. 

[0024] Alternatively, the user may isolate the text por- 
30 tion 2 by drawing a marking 4 around the text portion 2. 
In this case, based on this marking 4, the acts in boxes 
S2 and S4 are replaced by the user making the marking 
4 and scanning the document in Fig. 4 using MFD 222 
or another scanning device. Processor 234 of MFD 222 
35 or another processor could then extract the marked por- 
tion 2 using conventional image processing techniques. 
[0025] The user could alternatively highlight the text 
portion 2 using a suitable ink prior to the scanning by 
MFD 222 or another scanning device, and processor 
40 234 of MFD 222 or another processor could then extract 
text which has been thus highlighted. Techniques for 
performing this type of extraction are described in more 
detail in US-A-5,272,764. 

[0026] The scanned image could alternatively be pre- 
45 sented to the user, such as on the display of a computer, 
and the user could interactively isolate text portion 2, 
such as with a mouse or other pointing device or by suit- 
able interaction with a touchscreen or other user input 
device. 

so [0027] Once the isolated text portion 2 has been 
scanned and/or extracted, and image data defining text 
portion 2 has been provided to processor 234 of MFD 
222 or another processor, the processor can perform a 
conversion of the image data generated by the scanning 

55 or extracting operation to codes such as ASCII, as 
shown in box S6, with each code representing a charac- 
ter or other element that occurs in the language of text 
portion 2. The conversion can be performed using 



OCID:<EP I 



5 



7 



EP 0 927 939 A1 



8 



known optical character recognition (OCR) technology, 
such as ScanworX or TextBridge, available from Scan- 
Soft Corporation of Peabody, Massachusetts. OCR in 
box S6 could be preceded by character set recognition, 
and OCR could be performed in a manner that is appro- 
priate for the recognized character set but is not lan- 
guage-specific. The conversion could alternatively be 
performed using word recognition. 
[0028] Figure 6 shows the result of OCR in Fig. 3A on 
the text portion 2 in Fig. 5. The sequence of characters 
illustrated in OCRed text 6 represent a series of element 
codes that define expressions in the language of text 
portion 2, for which processor 234 of MFD 222 or 
another processor now has a file in ASCII format. At this 
stage the user may, provided MFD 222 or other proces- 
sor has a display and keyboard or other suitable user 
interface (Ul), correct any apparent errors in OCRed 
text 6. 

[0029] Returning to Fig. 3A, OCRed text 6 may next 
be subjected to a language guessing operation, as 
shown in box S8. If the language L1 of the text portion 2 
is not known in advance, the OCR operation in box S6 
may be sub-optimal. Language guessing techniques 
are discussed, for example, in Beesley K.R., „Language 
Identifier: a program for automatic language identifica- 
tion of on-line texts", in Languages at the Crossroads: 
Proa 29th Ann. Conf. Am. Translators Assoc. (12-16 
October 1988), pp. 47-54, and in Grefenstette, G., 
n Comparing Two Language Identification Schemes," 
JADT 1995, 3rd International Conference on Statistical 
Analysis of Textual Data, Rome, 11-13 Dec 1995, pp. 
263-268. The result of the optional language guessing 
operation in box S8 is to determine L1 or a language 
candidate — the language found to be the most likely for 
LI . Then, in box S10, OCR is performed once again on 
the scanned image of portion 2 using a language (L1) 
specific OCR tool. Again, at this stage the user may, 
provided MFD 222 or other processor has a display and 
keyboard or other suitable Ul, correct any apparent 
errors in the OCRed text generated by the language 
specific OCR operation. 

[0030] In box S12, OCRed text 6 can be tokenized, 
using conventional techniques such as those described 
in McEnery T and Wilson A., Corpus Linguistics (1996), 
Edinburgh Press, and also in US-A-5,721 ,939 to Kaplan 
and US-A-5,523,946 and US-A-5,325,091 to Kaplan et 
al. The result is tokenized text, meaning a text portion 
which is split up into tokens for further processing. 
[0031] With reference to box S14 in Fig. 3B, the 
tokens can be morphologically analyzed using a lan- 
guage (L1) specific analyzer Morphological analysis 
using finite state transducer technology is discussed 
further in EP-A-583,083. 

[0032] Next, in box S16, the words obtained as a 
result of the morphological analysis can be subjected to 
part-of-speech disambiguation or tagging, as described 
in detail in de Marcken C.G. ..Parsing the LOB Corpus", 
28th Ann. Meeting of the ACL, Pittsburgh, 6-9 June 



1990. See also in McEnery T. and Wilson A., Corpus 
Linguistics, Chapter 3 and Appendix B. 
[0033] Figure 7 illustrates the results of the tokeniza- 
tion, morphological analysis and part-of-speech tagging 

5 (disambiguation) steps in Fig. 3, as processed text 7. 
This is illustrated by means of three columns 8, 10, 12 
containing, respectively, the tokens derived from the 
OCRed text 6, the morphs obtained by FST morpholog- 
ical analysis, and the part-of-speech tags applicable to 

10 the word. Thus, for example, in the row designated 14, 
the word (token) „Plus" is in the first column, the morph 
..plus" in the second, and the tag ,,+ADV" (denoting 
adverb) in the third. 

[0034] Figure 8 shows the results of lemmatization in 
15 box S1 8 in Fig. 3B on the disambiguated text 7 of Fig. 7. 
Thus, for each wond-morph-tag triumvirate in the text 7, 
the lemma (or dictionary headword form) is extracted or 
the word itself is retained as the lemmatized form; see 
the abovementioned EP-A-583,083 for disclosure of 
20 how lemmatization may be implemented. The resulting 
set of words (generally designated 18 in Fig. 8) is used 
for subsequent processing. 

[0035] The acts in boxes S1 2 through S1 8 in Figs. 3 A 
and 3B are optional and could be replaced by any other 

25 operations that would prepare the automatically recog- 
nized codes for automatic translation. 
[0036] For example, one could replace the sequence 
of morphological analysis, part of speech tagging, and 
lemmatization by language-specific stemming, as dis- 

30 dosed for English in Porter, M.F., "An algorithm for Suf- 
fix Stripping", Program, Vol. 14, No. 3, 1980, pp. 130- 
137. In this case the dictionary headwords would have 
to undergo the same stemming processes before the 
lookup depicted in box S20 of Fig. 9. This technique cre- 

35 ates more noise than the technique described in boxes 
S12 through S1 8 since semantically different words are 
sometimes stemmed to the same stem by techniques 
such as those disclosed by Porter. For ©cample "fac- 
tory" and factorial" are both stemmed to "factori" by the 

40 Porter stemmer, which would mean that the dictionary 
entries for both would be conflated by using Porter 
stemming to replace the acts shown in boxes S12 to 
S18. 

[0037] Various similar techniques could be used 

45 instead of Porter stemming. 

[0038] Another alternative would be to apply a full 
form generator to dictionary headwords, generating 
duplicate dictionary entries for every possible form that 
a word could take. For example, the dictionary entry for 

so the word "infect" would be duplicated as many times as 
necessary in order to create dictionary entries for 
"infects", "infected", and "infecting." With such a greatly 
expanded dictionary, one could simply tokenize the 
input text or tokenize and part-of-speech tag the text. 

55 and look up using the word forms as they appear in the 
text since there would then be a headword in the diction- 
ary for every word form found in the text. This approach 
however has the drawback of making the dictionary 
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much bigger, and would not be feasible for languages 
such as Finnish or Arabic in which one word may have 
hundreds of different string representations in unlem- 
matized text. 

[0039] Figure 9 depicts the results of bilingual on-line 
dictionary look-up in box S20 in Fig. 3B, performed on 
each of the lemmas in the text 18 of Fig. 8. Here the 
second or target language L2 is English. Therefore, for 
each lemma, an on-line French-English dictionary is 
looked-up to obtain one or more corresponding transla- 
tions in English, in a manner known in the art. Transla- 
tion data indicating expressions in 12 could alternatively 
be obtained through on-line Web sites or other products 
or services that provide automatic translation, and look- 
up could alternatively be performed using a bilingual 
database, parallel corpora, or a manually or automati- 
cally constructed bilingual lexicon constructed from par- 
allel corpora. 

[0040] For the sake of illustration in Fig. 9, the informa- 
tion is presented in the following format: Lemma in L1 | 
Translation word(s) in L2. Thus, by way of example, for 
the sixth lemma „infecter", the translation words 
Jnfected", .septic", Jnfect" and .poison" are returned. 
The set of translation words in L2 (generally designated 
20 in Fig. 9) are used as the basis for subsequent 
processing. Again, at this stage the user may, provided 
MFD 222 or other processor has a display and keyboard 
or other suitable Ul, intervene to eliminate any 
unwanted translation words from among the set 20. 
[0041] In box S22 in Fig. 3C, the set 20 of translation 
words derived from the lemmas can be formatted into a 
classical information retrieval (IR) query in language L2. 
Figure 10 shows a textual query 22 defined by query 
data resulting from the formatting operation in box S22. 
The format of the query depends on the language of the 
monfllingualTR engine being used. The query 22 may 
be formatted for any suitable IR system, such as 
SMART (see Salton G., „The SMART retrieval system: 
Experiments in Automatic Document processing", Pren- 
tice-Hall, Englewood Cliffs, NJ, 1971). Once formatted, 
the query 22 can be sent, in box S24, to the monolingual 
(L2) IR engine (at a suitable site on the network) for 
retrieving information corresponding to the query 22. A 
list of document titles relevant to the query can be 
received back from the IR engine in the conventional 
manner and, if the list is not already ranked in order of 
relevance, the list can be modified so that the docu- 
ments are so ranked, as in box S26, in the manner dis- 
closed in the abovementioned Salton reference. 
[0042] Figure 1 1 illustrates a list 24 returned by the 
information retrieval engine, including titles of docu- 
ments matching the query of Fig. 10, in ranked order. As 
will be appreciated by persons skilled in the art, the 
invention can be suitably implemented by means of 
internet-based search techniques, and the list 24 of rel- 
evant documents (or hits) can be suitably displayed in 
HTML format by means of a conventional web browser. 
The titles (14 of which are shown) suitably provide links 



to the documents themselves and, as is conventional, 
the document itself may be retrieved and viewed by the 
user selecting (e.g. using a mouse cursor; but equally 
by keyboard entry, pull-down menu or selection via 

5 touchscreen) one of the titles or links in the list 24. Here, 
the user is interested in the second listed document 
.Technology Brief...", and has selected it for display by 
clicking with a mouse cursor on link 26, or by using any 
of the aforementioned selection methods. 

10 [0043] Figure 1 2 shows the display of (the first page 
of) a document 28 in the list 24 of Fig. 1 1 following~the 
selection of the document by a user. The document can 
thus be viewed. As is conventional, the user may print 
out the document 28 via a mouse click on print button 

15 30. 

[0044] However, it well be appreciated that MFD 222 
or another device at which the list 24 is obtained may be 
suitably programmed to automatically print out the list 
24 itself, all of the documents on the list or a predeter- 
20 mined number N of the documents on the list 24, as in 
box S28 in Fig. 3C. 

Claims 

25 1. A method of using documents with text to obtain 
data for use in information retrieval; the method 
comprising: 

(A) scanning a document that includes text in a 
30 first language to obtain text image data defining 

an image of a segment of the text; 

(B) performing automatic recognition on at 
least part of the text image data to obtain text 
code data; the text code data including a series 

35 of element codes, each indicating an element 

that occurs in the first language; the series of 
element codes defining a first set of expres- 
sions, each of which occurs in the first lan- 
guage; 

40 (C) performing automatic translation on a ver- 

sion of the text code data to obtain translation 
data; the translation data indicating a second 
set of expressions; each of the second set of 
expressions being a counterpart in the second 

45 language of one or more of the first set of 

expressions; and 

(D) using the second set of expressions to 
automatically obtain query data defining a 
query for use in information retrieval. 

50 

2. The method of claim 1 in which the document 
includes manual markings indicating the segment 
of the text and in which (A) comprises: 

55 scanning the document to obtain document 

image data defining an image of the document 
including the text; and 

using the document image data to obtain the 
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12 



text image data by extracting the segment indi- 
cated by the manual markings. 

3. The method of claim 1 in which (B) comprises: 

5 

performing optical character recognition on at 
least part of the text image data; the element 
codes including character codes indicating 
characters that occur in the first language. 

10 

4. The method of claim 3 in which (B) further com- 
prises: 

performing automatic language identification to 
obtain a language identifier indicating a candi- 15 
date language that is likely to be the predomi- 
nant language of the segment of the text; 
the optical character recognition being specific 
to the candidate language. 

20 

5. Themethod of claim3, further comprising, after (B): 



translation data, the translation data indicating 
possible counterparts in the second language 
for a subset of the words and lemmas indicated 
by the lemmatized data. 

8. The method of claim 7 in which (C1d) comprises 
looking up the words and lemmas indicated by the 
lemmatized data in a bilingual translation dictionary 
to obtain counterparts in the second language. 

9. The method of claim 1 in which the query data 
define the query in a format suitable for an informa- 
tion retrieval engine; 

the method further comprising: 

(E) providing the query data to the information 
retrieval engine. 

10. A system for using documents with text to obtain 
data for use in information retrieval; the system 
comprising: 



presenting the elements indicated by the series 
of element codes to a user; 
receiving signals from the user indicating mod- 25 
ifications of the presented elements; and 
modifying the series of element codes in 
accordance with the signals from the user to 
obtain the version of the text code data on 
which automatic translation is performed. 30 

6. The method of claim 1 in which (C) comprises: 

(C1) using the version of the text code data to 
access a translation dictionary with each of the 35 
first set of expressions; the translation diction- 
ary providing the translation data in response. 

7. The method of claim 6 in which the sequence of 
element codes defines a first set of words that 40 
occur in the first language and in which (C1) com- 
prises: 

(Cla) tokenizing the text code data to obtain 
token data indicating tokens that occur in the 45 
sequence of element codes, the tokens includ- 
ing the first set of words; 
(C1b) disambiguating the token data to obtain 
disambiguated data; the disambiguated data 
including, for each of the first set of words, a so 
part-of-speech indicator indicating the word's 
part of speech; 

(C1c) lemmatizing the disambiguated data to 
obtain lemmatized data; the lemmatized data 
indicating, for each of the first set of words. 55 
either the word or a lemma for the word; and 
(C1d) translating the words and lemmas indi- 
cated by the lemmatized data to obtain the 



a scanning device (103; 120; 230) for scanning 
documents and providing image data; 
a processor (100; 234) connected for receiving 
image data from the scanning device; after 
receiving text image data defining an image of 
a segment of text in a first language from a doc- 
ument scanned by the scanning device, the 
processor operating to: 

perform automatic recognition on at least 
part of the text image data to obtain text 
code data; the text code data including a 
series of element codes, each indicating 
an element that occurs in the first lan- 
guage; the series of element codes defin- 
ing a first set of expressions, each of which 
occurs in the first language; 
perform automatic translation on a version 
of the text code data to obtain translation 
data; the translation data indicating a sec- 
ond set of expressions; each of the second 
set of expressions being a counterpart in 
the second language of one or more of the 
first set of expressions; and 
use the second set of expressions to auto- 
matically obtain query data defining a 
query for use in information retrieval. 
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13 300 PERSONNES infectdes 
par le virus du sida suivent actuel- 
Icment en France 
qui assotie plusieurs medicaments 
dont une motfcuie ajitiprotlase. 
Des le lr janvier 1997, ces aaite- 
ments devraient toe disponibles 
en phaxmacie. La polemique du 
printemps dernier est ainsi ou- 
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de rationnement, lc Conseil natio- 
nal du sida avaic envisage" de recou- 
ririun arage au sort des malades. 
Si etles demeurent des traitemems 
experimentaux dont I'efficadt* ne 
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les tritherapies sembient d'ores et 
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900618-0050. 
</DOCID> 
<HL> 

Technology Brief — ICN Pharmaceuticals Inc.: 

Ireland Authorizes Using 

Ribavirin on HIV Patients 
</HL> 
<DATE> 
06/18/90 
</DATE> 
<SO> 

WALL STREET JOURNAL (J) , PAGE B4 
</SO> 

<co> 28 

ICN 
</CO> 
<IN> 

DRUG MANUFACTURERS (DRG) 
</IN> 
<LP> 

ICN Pharmaceuticals Inc. said Ireland's Department of 
Health authorized the company to market the anti-viral d 
ribavirin for use in the "management" of patients infect 
with the human immunodeficiency virus and other conditio 
related to acquired immune deficiency syndrome. 

The approval, the Costa Mesa, Calif., company said, m 
the first time that a country has authorized the anti-vi 
drug specifically for the treatment of HIV-infected pati 
Ribavirin already is used in the U.S. and in 16 other na 
for the treatment of infants and young children with sev 
respiratory tract infections caused by respiratory syncy 
virus. Ribavirin is used in more than 30 additional coun 
to treat other viral infections. 
</LP> 
<TEXT> 

ICN recently withdrew its request for U.S. government 
approval to market ribavirin for the treatment of the HI 
virus 
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